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Abstract 



Background: Hidden Markov models are widely employed by numerous Bioinformatics programs used today. 
Applications range widely from comparative gene prediction to time-series analyses of micro-array data. 
Typically, the parameters of these hidden Markov models have been set up with a specific data set in mind, for 
example the genome of a particular species. Determining a set of parameters that yield a high prediction 
performance and adjusting these parameters for new data sets is typically done manually (if at all) as most 
programs do not incorporate algorithms for automatic parameter training. This manual parameter training can 
be very time-consuming and does not necessarily result in a better performance. Most existing Bioinformatics 
applications risk becoming obsolete unless they can be readily adapted to new data sets. They would thus gain 
a better performance and much wider applicability by incorporating algorithms for automatic parameter training. 
Results: We present a new method for training the parameters of hidden Markov models using sampled state 
paths. As these state paths are sampled from the posterior distribution, we call our new training algorithm 
posterior sampling training. We also introduce a computationally efficient, linear-memory algorithm for posterior 
sampling training. In addition, we introduce the first linear-memory algorithm for Viterbi training. We evaluate 
posterior sampling training by comparing it to Viterbi training and Baum-Welch training for a hidden Markov 
model and find that posterior sampling training outperforms Baum-Welch training in terms of performance and 
computationally efficiency and that it is significantly more robust than the commonly used method of Viterbi 
training. 
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Conclusions: The new parameter training method of posterior sampling training provides a computationally very 
efficient and robust way of training the parameters of hidden Markov models and is also comparatively easy to 
implement. Together with the first linear-memory algorithm for Viterbi training that we present here (and the 
linear-memory algorithm for Baum-Welch training introduced earlier), parameter training can now be attempted 
even for complex models and long training sequences. 



Background 

Hidden Markov models (HMMs) and their variants are widely used for analyzing biological sequence data. 
Bioinformatics applications range from methods for comparative gene prediction (e.g. [1,2]) to methods for 
identifying protein domains (e.g. [3]) and predicting protein interfaces (e.g. [4]), inference models for 
genome-wide association studies (e.g. [5]) and disease association tests for inferring ancestral 
haplotypes (e.g. [6]). 

Most of these Bioinformatics applications have been set up for a specific type of analysis and a specific 
biological data set. The model's states and the implemented prediction algorithms determine the the type 
of analysis that can be performed, whereas the parameters of the HMM are typically chosen to optimize 
the prediction performance for a particular data set. As the quality of the predictions depend on the 
model's parameters, these parameters have to be carefully chosen in order to maximize the prediction 
performance. For example, a gene prediction method whose parameters have been initially chosen to 
predict genes in the mouse genome should not be expected to predict fugu genes with the same 
performance, but its parameters should first be adapted to the fugu genome. As most methods that 
employ HMMs have been set up with a particular data set in mind, they risk becoming obsolete or 
sub-optimal unless their parameters can be readily adapted to new biological data sets. Manually finding a 
good set of parameter values that yields a high prediction performance can be a very time consuming task 
which is not guaranteed to improve the performance. 

Efficient algorithms for automatic parameter training are thus of great importance for adapting existing 
applications to new data sets and for improving their prediction performance, thereby prolonging their 
life-span beyond the data set for which they were initially devised. [7] shows the benefits of having a 
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readily retrainable model for non-comparative gene prediction. He presents the SNAP gene prediction 
program which can be easily adapted to new genomes provided a "bootstrap" set of known gene structures 
exists for parameter training. 

The aim of parameter training is to determine values which optimize the HMM's performance in terms of 
sensitivity and specificity. All existing training algorithms derive parameter values from a set of training 
sequences (the training set). Once the parameters have been determined, the performance of the trained 
HMM can then be evaluated in terms of sensitivity and specificity by analyzing a set of test sequences 
whose annotation we know (the test set) and by comparing the predicted to the known annotation. In 
order for any training procedure to be successful, we thus need three ingredients: (1) a large and diverse 
training set, (2) a large and diverse test set which does not overlap the training set and whose annotation 
we know and and (3) a parameter training algorithm which can analyze the training set in a 
computationally efficient way and derive parameter values that improve the prediction performance. 
There already exist several training algorithms for HMMs. Vitcrbi training [8] is an iterative training 
procedure that derives new parameter values <j> from the observed counts of emissions and transitions in the 
Vitcrbi paths [9] of the training sequences. The iteration stops when the Viterbi paths of the training 
sequences no longer change. Viterbi training is typically easiest to implement as most HMMs generate 
predictions by calculating the Viterbi path. Viterbi training is employed by several Bioinformatics 
applications, e.g. non-comparative methods for predicting genes in fungal genomes [10], where it is 
combined with a Markov chain Monte Carlo strategy) and non-comparative methods for predicting genes 
in prokaryotes [11]. Baum- Welch training [12] is an expectation maximization (EM) algorithm [13]. As the 
Viterbi algorithm, Baum- Welch training is an iterative algorithm. In each iteration, the estimated number 
of counts for each transition and emission is derived by considering all possible state paths in the model 
(rather than the single Viterbi path) for all training sequences X. The iteration is typically stopped after a 
fixed number of iterations or once the change in the log-likelihood is sufficiently small. Baum- Welch 
training is guaranteed to find a local optimum of the likelihood P(X\<j)) [8], whereas Viterbi training finds 
parameter values <f) that maximize the likelihood for both parameter values (p and the Viterbi state paths 
n*, P(X\(j),H*). Baum- Welch training using the forward and backward algorithm [8] is, for example, 
implemented into the prokaryotic gene prediction method EasyGene [14] and the HMM-compilcr 
HMMoC [15]. The result of Viterbi training and Baum- Welch training may strongly depend on the chosen 
set of initial parameter values. We have by now a linear-memory algorithm for Baum- Welch 
training [16, 17] that is not only computationally very efficient, but also relatively easy to implement. 
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We here present the first linear-memory algorithm for Viterbi training as well as a novel, third strategy for 
parameter training. Unlike Viterbi training which considers only a single state path and unlike 
Baum- Welch training which considers all possible state paths for any given training sequence, our new 
training algorithm derives new parameter values from state paths that are sampled from the posterior 
distribution. We call this new training method posterior sampling training. Similar to Viterbi and 
Baum- Welch training, this new algorithm works in an iterative way. In each iteration, a new set of 
parameter values is derived from the transitions and emissions in the sampled state paths for all sequences 
in the training set. The training is stopped after a fixed number of iterations or as soon as the change in 
the log-likelihood is sufficiently small. We also present a linear-memory algorithm for posterior sampling 
training that does not require an entire matrix to be kept in memory. This new and computationally 
efficient algorithm renders parameter training even for complex HMMs and long training sequences 
practical and is also comparatively easy to implement. 

Methods and Results 

Posterior sampling training 

In order to simplify the notation in the following, we will assume without loss of generality that we are 
dealing with a lst-order HMM where the Start state and the End state are the only non-silent states. Our 
description of the existing and the new algorithms easily generalize to higher-order HMMs, HMMs with 
more silent states (provided there exists no circular path in the HMM involving only silent states) and 
n-HMMs which read n un-aligncd input sequences at a time rather than only a single input sequence. 
The HMM is defined by 

• a set of states S = {0,1,..., M}, where state denotes the Start and state M denotes the End state 
and where all other states are non-silent, 

• a set of transition probabilities T = {U,j\i,j £ S}, where tij denotes the transition probability to go 
from state i to state j and J2ies = 1 f° r ever y state i £ S and 

• a set of emission probabilities £ = {ei(y)\i £ S,y £ A}, where e%{y) denotes the emission probability 
of state i for symbol y and X^e-A e i(v) ~ 1 f° r every non-silent state i £ S and A denotes the 
alphabet from which the symbols in the input sequences are derived, e.g. A = {A, C, G, T} we are 
dealing with DNA sequences. 

We also define: 
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• T max is the maximum number of states that any state in the model is connected to, also called the 
model's connectivity. 

• X — {X 1 , X 2 , . . . , X N } denotes the training set of N sequences, where each particular training 
sequence X % of length L % is denoted X % = (x\, x\, . . . , x l Li ). In the following and to simplify the 
notation, we pick one particular training sequence X e X as representative which we denote 

X = (X!,X 2 , ■ ■ -,x L ). 

• II denotes a state path in the HMM. 

For a given HMM and a given input sequence X, the full probability P{X) of the sequence in the HMM 
defines a posterior distribution P(JX\X) over state paths n via the equation: 

P(X,U) 



P(U\X) 



P(X) 



Sampling a state path using the forward algorithm and backtracking 

Before we introduce the new algorithm, we first explain how we can sample a state path from the posterior 
distribution P(II|X) using a combination of existing algorithms [8]. We first calculate all values in the 
two-dimensional forward matrix using the forward algorithm before then invoking a back-tracking 
procedure to sample a state-path. 

• fi(k) denotes the sum of probabilities of all state paths that have read the input sequence X up to 
and including sequence position k and that end in state i, i.e. fi(k) = P(x\, . . . ,Xk, s(xk) — i), where 
s(xk) denotes the state that reads sequence position Xk from input sequence X. We call fi(k) the 
forward probability for sequence position k and state i. 

• Pi(k,m) denotes the probability of selecting state m as the previous state while being in state i at 
sequence position k (i.e. sequence position k has already been read by state i). For a given sequence 
position k and state i, Pi(k,m) defines a probability distribution over previous states as 

The forward matrix is calculated using the following forward algorithm: 

Initialization: at the start of the input sequence, consider all states m e S in the model and set 



/m(0) 



1 m = 
ra/0 
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Recursion: loop over all positions k from 1 to L in the input sequence and loop, for each such sequence 
position k, over all states m £ 5\{0} = {1, . . . , M} and set 

M 

f m (k) = e m (x k ) ■ ^2 f n (k - 1) • t„, m (1) 
Termination: at the end of the input sequence, i.e. for k = L, set 

M 

P{X) = .Jm{L) = ^ fn(L x ) ■ t n ,M 
n=0 

Once we have calculated all forward probabilities fi(k) in the two-dimensional forward matrix, i.e. for all 
states i in the model and all positions k in the input sequence, we can then use the following back-tracking 
procedure to sample a state path from the posterior distribution P(H\X). 

The back-tracking starts at the end of the input sequence, i.e. in k = L, in the End state, i.e. i — M, and 
selects state m as the state for the previous sequence position k — 1 with probability: 

Pi(k, m) = e ^(^)-/»^- 1 )-W (2) 

This procedure is continued until we reach the start of the sequence and the Start state. The resulting 
succession of chosen previous states corresponds to one sampled state path. 

As the back-tracking procedure requires the entire matrix of forward values for all states and all sequence 
positions, the above algorithm for sampling a state path requires O(ML) memory and 0{MT max L) time 
in order to first calculate the matrix of forward values and then O(L) memory and 0{LT max ) time for 
sampling a single state path from the matrix. Note that more state paths can be sampled without having 
to recalculate the matrix of forward values. For sampling K state paths for the same sequence, we thus 
need 0{MT max L + KLT max ) time and 0{ML) memory, if we do not to store the sampled state paths 
themselves. 

Posterior sampling training 

We now describe a novel algorithm for training the parameter of an HMM using state paths sampled from 
the posterior distribution P(U\X). Unlike Viterbi training which considers only a single state path and 
unlike Baum-Welch training which considers all possible state paths for every training sequence, our new 
algorithm derives new parameter values from state paths that are sampled from the posterior distribution 
P(H\X). Similar to Viterbi training and Baum-Welch training, we choose an iterative procedure. In each 
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iteration, we derive a new set of parameter values from one or multiple state paths that we sample from 
the posterior distribution for each training sequence X given the model and its current parameter 

values. The new parameter values thus correspond to the observed relative frequencies of the emissions and 
transitions in the sampled state paths. 

We first describe the new algorithm for training the parameters of an HMM using state paths sampled 
from the posterior distribution. This algorithm corresponds to an iterative procedure. The iterations are 
stopped once a maximum number of iterations have been reached or once the change in the log-likelihood 
is sufficiently small. 
In the following, 

• let El(y, X, TV) denote the number of times that state i reads symbol y from input sequence X in 
state path II given the HMM with parameters from the g-th iteration and 

• let Tfj (X, II) denote the number of times that a transition from state i to state j is used in state 
path II for sequence X given the HMM with parameters from the q-th iteration. 

In the following, the superfix q will indicate from which iteration the underlying parameters of the HMM 
derive. If we consider all N sequences of the training set X = {X 1 , . . . X N } and sample K state paths nj! 
with fee {1, . . . K} for each sequence X n in the training set, the step which updates the values of the 
transition and emission probabilities can be written as: 



n=l l^k=l ^i.jA '^k) /„s 

lJ v;/ ,v: ; vf ; 7;' ; LVMi;;i u 

e q+1 (n) = ElEj ^fc=i E i(y' Xn > n fc) ( 4 \ 

T,y'zA T,n=i EliLi E i(v'> Xn > n 3b) 

Equations (J3j> and Q assume that we know the T^(X n , njj) and Ef(y, X n , U^) values, i.e. that we know 
how often each transition and emission is used in each sampled state path for every training sequence 
X n . 

If our computer has enough memory to use the forward algorithm and the backtracking procedure 
described above, each iteration in the training algorithm would require 0(M maxijLj} + K ^maxi{L.j}) 
memory and 0{MT max JZj—i Li + K JZj—i ^i) time, where X^iLi i s the sum of the N sequence lengths in 
the training set X and maxijLi} the length of the longest sequence in training set X . As we do not have to 
keep the K sampled state paths in memory, the memory requirement can be reduced to C(Mmaxj{Lj}). 
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For many Bioinformatics applications, however, where the number of states in the model M is large, the 
connectivity T max of the model high or the training sequences are long, these memory and time 
requirements are too large to allow automatic parameter training using posterior sampling training. 

A linear-memory algorithm for posterior sampling training 

We now introduce a linear-memory algorithm for posterior sampling training. The idea for this algorithm 
stems from the following observations: 

(1) If we consider the description of the forward algorithm above, in particular the recursion in 
Equation ([lj, we realize that the calculation of the forward values can be continued by retaining only 
the values for the previous sequence position. 

(2) If we have a close look at the description of the backtracking algorithm, in particular the sampling 
step in Equation ([2| , we observe that the sampling of a previous state only requires the forward 
values for the current and the previous sequence position. So, provided we are at a particular 
sequence position and in a particular state, we can sample the state at the previous sequence position 
if we know all forward values for the previous sequence position. 

(3) If we want to sample a state path II from the posterior distribution P(n|X), we have to start at the 
end of the sequence in the End state, see the description above and Equation ^ above. (The only 
valid alternative for sampling state paths from the posterior distribution would be to use the 
backward algorithm [8] instead of the forward algorithm and to start the backtracking procedure at 
the beginning of the sequence in the Start state.) 

Observations (1) and (2) above imply that local information suffices to continue the calculation of the 
forward values (1) and to sample a previous state (2) if we already are in a particular state and sequence 
position, whereas observation (3) reminds us that in order to sample from the correct probability 
distribution, we have to start the sampling at the end of the training sequence. Given these three 
observations, it is not obvious how we can come up with a computationally more efficient algorithm. In 
order to realize that a more efficient algorithm does exist, one also has to note that: 

(4) While calculating the forward values in the memory-efficient way outlined in (1) above, we can 
simultaneously sample a previous state for every combination of a state and a sequence position that 
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we encounter in the calculating of the forward values. This is possible because of observation (2) 
above. 

(5) In every iteration q of the training procedure, we only need to know the values of T^(X, TV) and 
El(y, X, II), i.e. how often each transition and emission appears in each sampled state path II for 
every training sequence X, but not where in the matrix of forward values the transition or emission 
was used. 

Given all observations (1) to (5) above, we can now formally write down a new, linear-memory algorithm 
which calculates T^(X, II) and Ef(y,X, II) in a computationally efficient way. In order to simplify the 
notation, we consider one particular training sequence X — (xi, . . .xl) of length L and omit the superfix 
for the iteration q, as both remain the same throughout the following algorithm. T iy j(k,m) denotes the 
number of times the transition from state i to state j is used in a sampled state path that finishes at 
sequence position k in state m and Ei(y, k) denotes the number of times state i read symbol y in a state 
path that finishes at sequence position k in state m. As introduced earlier, fi(k) denotes the forward 
probability for sequence position k and state i and Pi(k, m) the probability of selecting state m as the 
previous state while being in state % at sequence position k. In the following, i,j,n G S, y G A and I G S 
denotes the sampled previous state. 

Initialization: at the start of the training sequence X and for all states m G S, set 

'-«» = {l z;l 

Tij(0,m) = 
Ei(y,0,m) = 

Recursion: loop over all positions k from 1 to L in the training sequence X and loop, for each such 
sequence position k, over all states m G 5\{0} = {1, . . . , M} and set 

M 



fm(k) = e m {x k ) ■ f n (k - 1) ■ t 

n.r, 



n=0 



p m (k,n) = 



fm(k) 

T it j{k,m) = Tij(k - 1,1) + 8i t i- S m j 
Ei(y,k,m) = E^y, k - 1, 1) + 8 lti ■ 5 VtX 
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Termination: at the end of the input sequence, i.e. for k = L, set 



M 



f M (L) 



n=0 



p M (L,n) 




Tij{L,M) = T^{L,l) + 5 hi -8M,3 
At the end, T^(X,Il) = T itj (L,M) and E?(y,X,IL) = Ei(y,L,M) (and Pi(X) = f M (L)), i.e. we know 



how often a transition from state i to state j was used and how often symbol y was read by state i in a 
state path II sampled from the posterior distribution P(X, II) in iteration q for sequence X. As is clear 
from the above description, the calculation of the f m , p m , Tij and Ei values for sequence position k 
requires only the respective values for the previous sequence position k — 1, i.e. the memory requirement 
can be linearized with respect to the length of the training sequence X. 

For an HMM with M states, a training sequence of length L and for every free parameter to be trained, we 
thus need O(M) memory to store the f m values, 0(MT max ) memory to store the p m values and O(M) 
memory to store the cumulative counts for the free parameter itself in every iteration, e.g. the Tij values 
for a particular transition from state i to state j. If we sample K state paths, we have to store the 
cumulative counts from different state paths separately, i.e. we need K times more memory to store the 
cumulative counts for each free parameter, but the memory for storing the f m and the p m values remains 
the same. Overall, if K state paths are being sampled in each iteration, we thus need O(M) memory to 
store the f m values, 0(MT max ) memory to store the p m values and O(MK) memory to store the 
cumulative counts for the free parameter itself in every iteration. For an HMM, the memory requirement of 
the new training algorithm is thus independent of the length of the training sequence. 
For training one free parameter in the HMM with the above algorithm, each iterations requires 
0(MT max L) time to calculate the f m and the p m values and to calculate the cumulative counts. If K state 
paths are being sampled in each iteration, the time required to calculate the cumulative counts increases to 
0(MT max LK), but the time requirements for calculating the f m and p m values remains the same. 
For sampling K state paths for the same input sequence and training one free parameter, we thus need 
0(M + MT max + MK) memory and 0(MT max LK) time for every iteration. If P parameters are to be 
trained and Q £ {1, . . . P} is the number of parameters to be trained in parallel, the memory requirement 
increases slightly to 0(M + MT max + MKQ) and the time requirement becomes 0(MT max LK ^) . The 
algorithm can therefore be readily adjusted to reduce the time as much as possible by using the maximum 
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amount of available memory . 

This can be directly compared to the algorithm described in 2.1 with requires O(ML) memory and 
0(MT max L + KLT max ) = 0(T max L(M + K)) time to do the same. Our new algorithm thus has the 
significant advantage of linearizing the memory requirement and making it independent of the sequence 
length for HMMs while increasing the time requirement only by a factor of ^f_^ K , i-e. decreasing it when 
only one state path K = 1 is sampled. 

A new linear-memory algorithm for Viterbi training 

Of the few HMM-based methods that provide automatic: algorithms for parameter training. Viterbi 
training [8] is probably the most popular. Similar to Baum- Welch training, it is an iterative training 
procedure. Unlike Baum- Welch training, it only considers a single state path, namely the Viterbi path, 
when deriving new sets of parameters. Whereas Baum- Welch training can be shown to find a local 
optimum of the likelihood, Viterbi training optimizes the likelihood for both parameters and the Viterbi 
state paths. In each iteration, a new set of parameter values is derived from the counts of emissions and 
transitions observed in the Viterbi paths [9] of the training sequences. The iterations stop when the Viterbi 
paths of the training sequences no longer change. The main reason why Viterbi training is popular is that 
is readily implemented if the HMM's predictions arc generated with the Viterbi algorithm. 
Section 2 introduced a new algorithm for parameter training, called posterior sampling training, as well as 
a linear memory algorithm for this type of training. The main difference with respect to Viterbi training is 
that we sample K state paths in a non-deterministic way from the posterior distribution P(II|X) rather 
than considering only a single Viterbi path. We can thus adapt the ideas of section 2 to devise a 
linear-memory algorithm for Viterbi training once we have realized that all we need to do is to replace the 
probabilistic sampling of a previous state by the deterministic choice of the previous state as the state from 
which the Viterbi matrix element at the current state and sequence position was derived. 
As in section 2, Ef(y,X, II) denotes the number of times that state i reads symbol y from input sequence 
X in state path II given the HMM with parameters from the q-th iteration and TJ? • (X, II) denote the 
number of times that a transition from state i to state j is used in state path II given the HMM with 
parameters from the q-th iteration. 

In the following, the superfix q will indicate from which iteration the underlying parameters derive. If we 
consider all N sequences of a training set X = {X 1 , . . . X N } and the Viterbi path 11" for each sequence X n 
in the training set, the recursion which updates the values of the transition and emission probabilities can 
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be written as: 



These expressions are strictly analogous to equations [3] and [3] in section 2. As before, these equations 
require that we already know the values of TfAX n , II™) and E?(y, X n , II™), i.e. how often each transition 
and emission is used in the Viterbi path 11™ for training sequence X n . 

One straightforward way to determine these T?j(X n ,IL n ) and Ef(y, X n , 11") values would be to calculate 
the two-dimensional Viterbi matrix for every training sequence X n , to then derive a Viterbi state path II™ 
from the Viterbi matrix using the well-known traceback procedure [9] and to simply count how often each 
transition and each emission was used in that Viterbi state path. Using this strategy, every iteration in the 
Viterbi training algorithm would require C(Mmaxi{Li} + max^lL^}) memory and 

0{MT max ^2n—i Li + Yli=i Li) time, where Yli=i ^ s the sum of the N sequence lengths in the training 
set X and max^jLi} the length of the longest sequence in training set X . However, for many 
Bioinformatics applications where the number of states in the model M is large, the connectivity T max of 
the model high or the training sequences are long, these memory and time requirements are too large to 
allow automatic parameter training using this algorithm. 

A linear-memory version of the Viterbi algorithm, called the Hirschberg algorithm [18], has been known 
since 1975. It can be used to derive Viterbi paths in linearized memory while increasing the time 
requirement by up to a factor of two. One significant disadvantage of the Hirschberg algorithm is that it is 
not as easy to implement as the Viterbi algorithm. Only a very few HMM-based applications in 
Bioinformatics actually employ it, see e.g. [1, 19] where the algorithm is used in conjunction with a 
pair-HMM. Using the Hirschberg algorithm, every iteration in the Viterbi training algorithm would thus 
require 0(M + maxi{Li}) memory and 0(MT max Xa=i + T^i=i ^i) time, where 53j=i 1S sum of 
the N sequence lengths in the training set X and max^jLi} the length of the longest sequence in training 
set X. However, as we will sec in the following, this can be further improved. 

Our previous observations (1) to (5) from section 2 that led to the linear-memory algorithm for posterior 
sampling training can be replaced by similar observations for Viterbi training: 

(VI) If we consider the description of the Viterbi algorithm [9], in particular the recursion, we realize that 
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the calculation of the Viterbi values can be continued by retaining only the values for the previous 
sequence position. 

(V2) If we have a close look at the description of the traceback procedure [9] , we realize that we only have 
to remember all Viterbi matrix elements at the previous sequence positions in order to deduce the 
state from which the Viterbi matrix element at the current sequence position and state was derived. 

(V3) If we want to derive the Viterbi path II from the Viterbi matrix, we have to start at the end of the 
sequence in the End state M. 

Observations (VI) and (V2) imply that local information suffices to continue the calculation of the Viterbi 
matrix elements (VI) and to derive a previous state (V2) if we already are in a particular state and 
sequence position, whereas observation (V3) reminds us that in order to derive the Viterbi path, we have 
to start at the end of the training sequence. Given these three observations, it is as in section 2 not 
obvious how we can come up with a computationally more efficient algorithm for training with Viterbi 
paths. In order to realize that a more efficient algorithm does exist, one also has to also note that: 

(V4) While calculating the Viterbi matrix elements in the memory-efficient way outlined in (VI), we can 
simultaneously keep track of the previous state from which the Viterbi matrix element at every 
current state and sequence position was derived. This is possible because of observation (V2) above. 

(V5) In every iteration q of the training procedure, we only need to know the values of T?j(X,H) and 
El(y, X, II), i.e. how often each transition and emission was used in each Viterbi state path II for 
every training sequence X, but not where in the Viterbi matrix each transition and emission was used. 

Given all observations (VI) to (V5), we can now formally write down a new, linear-memory algorithm 
which calculates T^(X, II) and Ef(y,X, II) in a computationally efficient way with linearizes the memory 
requirement and which is — compared to the Hirschberg algorithm — easy to implement. In order to 
simplify the notation, we describe the following algorithm for one particular training sequence X and omit 
the superfix for the iteration q, as both remain the same throughout the algorithm. As before, Tij(k,m) 
denotes the number of times the transition from state i to state j is used in a Viterbi state path that 
finishes at sequence position k in state m and where Ei(y, k) denotes the number of times state i read 
symbol y in a Viterbi state path that finishes at sequence position k in state m. In the following, 
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• Vi(k) denotes the Viterbi matrix element for state i and sequence position k, i.e. Vi{k) is the 
probability of the Viterbi state path, i.e. the state path with the highest overall probability, that 
starts at the beginning of the sequence in the Start state and finishes in state i as sequence position k 

and i,j,n € S, y € A and I e S denotes the previous state from which the current Viterbi matrix element 
v m (k) was derived. 

Initialization: at the start of training sequence X = 

v m (0) = 
T id (p,m) = 
Ei(y,0,m) = 

Recursion: loop over all positions k from 1 to L in the training sequence X and loop, for each such 
sequence position k, over all states m e 5\{0} = {1, . . . , M} and set 

v m (k) = e m (x k ) ■ max{vi(k - 1) • t Lm } 
Tij(k,m) = T id (k - 1,1) + Sis ■ S mJ 
Ei(y,k,m) = Ei(y,k -1,1) + 5i fi ■ 6 ViXk 

Termination: at the end of the input sequence, i.e. for k = L, set 

v M {L) = max • t riiM } 
Tij(L, M) - T id (L,l) + 5i,i -S M ,j 

At the end, T? tj (X,IL) = T itj (L,M) and E?(y,X,U) = E t {y,L,M) (and P«(X,IL) = v M (L)), i.e. we know 
how often a transition from state i to state j was used and how often symbol y was read by state i in 
Viterbi state path II in iteration q. As is clear from the above description, the calculation of the v m , T id 
and Ei values for sequence position k requires only the respective values for the previous sequence position 
k — 1, i.e. the memory requirement can be linearized with respect to the length of the training sequence X. 
For an HMM with M states and a training sequence of length L and every free parameter of the HMM 
that we want to train, we thus need in every iteration O(M) memory to store the v m values and O(M) 
memory to store the cumulative counts for the free parameter itself, e.g. the T id values for a particular 
transition from state i to state j. For an HMM, the memory requirement of the new training algorithm is 
thus independent of the length of the training sequence. 



(xi, . . . , x L ) and for all meS, set 

f 1 TO=0 

\ m^O 
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For training one free parameter in the HMM with the above algorithm, each iterations requires 
0(MT max L) time to calculate the v m values and to calculate the cumulative counts. If P parameters are 
to be trained and Q £ {1, . . . P} is the number of parameters to be trained in parallel, the memory 
requirement increases slightly to 0(M + MQ) and the time requirement becomes 0(MT max L^) . As for 
the posterior sampling algorithm, the linear-memory algorithm for Viterbi training can therefore be readily 
adjusted to reduce the time as much as possible by using the maximum amount of available memory . 
This can be directly compared to the algorithm described above with first calculates the entire Viterbi 
matrix and which requires 0{ML) memory and 0(MT max L + LT max ) = 0(T max L(M + 1)) time to do the 
same. Our new algorithm thus has the significant advantage of linearizing the memory requirement and 
making it independent of the sequence length for HMMs while keeping the time requirement essentially the 
same. It is thus as memory efficient as Viterbi training using the Hirschbcrg algorithm, while being more 
time efficient and significantly easier to implement. 

Practical evaluation of the different training methods 

The previous sections discuss the theoretical advantages and disadvantages of the different parameter 
training methods in detail. They are summarized in Table 1. In order to investigate how well the different 
methods do in practice, we implemented Viterbi training, Baum- Welch training and posterior sampling 
training for the well-known example the dishonest casino [8] . This casino consists of a fair and a loaded 
dice. The fair dice generates numbers from A = {1, 2, 3, 4, 5, 6} with equal probability, whereas the loaded 
dice generates the same numbers in a biased way. The properties of the dishonest casino are readily 
captured in a four-state HMM shown with 8 transition and 12 emission probabilities, six each for each of 
the two non-silent states. Fixing the values of the transition probabilities from the Start to two non-silent 
states and those from the two non-silent states to the End state leaves us to choose the value of four 
transition probabilities. Due to the constraint that J^ieS ^(^j) = 1 f° r an i & however, we only have to 
train two independent transition probabilities. Together with the emission probabilities of the two 
non-silent states, we thus have to train 14 independent parameters. 

We implemented each of the three training methods using the most time and memory efficient versions, i.e. 
using the linear-memory algorithm for Baum- Welch training [16, 17], the linear-memory algorithm for 
Viterbi training presented here as well as the linear-memory algorithm for the new type of posterior 
sampling training introduced here. In order to investigate how strongly the outcome of the three training 
methods depends on the initial parameter values, we generated three sets of randomly chosen initial 
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parameter values which we presented to every training method. Our set of training sequences consists of 
10 sequences of length 30000 each which were generated using the original HMM of the dishonest casino. 
Using the original HMM of the dishonest casino we generated a further 20 sequences of length 30000 each 
to be used as test set for three- fold cross-evaluation. 

Performance of the three training methods 

Our first goal was to investigate how the prediction performance varies as function of the number of 
iterations. Figure 1 shows the performance as function of the number of iterations for all three training 
methods. We tried to reduce other factors which may affect the comparison and thus decided not to use 
pseudo-counts. For posterior sampling training, we also investigate how the performance depends on the 
number of state paths sampled for every training sequence in every iteration. 

Viterbi training terminated after a few iterations and with a very low performance due to too many 
parameters being set to zero. In Viterbi training, a parameter is set to zero (and remains zero) as soon as a 
the respective transition or emission does not occur in any Viterbi path of one iteration. The results for 
Viterbi training arc therefore not included into Figure 1. The performance for Baum- Welch training and 
posterior training increases with the number of iterations, the most rapid increase in performance being 
made between 30 and 60 iterations. All three variants of posterior sampling training result in essentially 
the same performance, i.e. it does not matter whether one, three or five state paths are sampled for every 
training sequence in every iteration. Posterior sampling training outperforms Baum- Welch training by 
reaching a higher performance within fewer iterations and by having essentially the same peak performance 
as Baum- Welch training within error bars. The size of the error bars in Figure 1 shows that the 
performance of each training method does not strongly depend on the initial set of randomly chosen 
parameter values and that these difference further decrease with more iterations. 

Parameter convergence of the three training methods 

The objective of parameter training is not only to get a high prediction performance, but also to recover 
the parameter values of the model which was originally used to generate the training sequences. We 
therefore investigated how well the trained parameter values converge to the reference parameter values. 
Figure 2 shows the average root mean square difference (RMSD) between the trained and the reference 
transition probabilities and the average root mean square difference between the trained and the reference 
emission probabilities for every training method. The average is taken over three training rounds, each 
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starting with a different set of randomly chosen initial parameter values. 

Viterbi training terminated after a few iterations due to too many parameters being set to zero and the 
corresponding results are thus not included in Figure 2. The initial RMSD values for the transition 
probabilities are all higher than those for the emission probabilities as there are fewer transitions 
probabilities that have to fulfill any constraint of type Y]j U.j — 1 than there are emission probabilities 
ei(y) that have to fulfill any constraint of type ^2 y ei(y) = 1, in other words, the RMSD between the 
randomly chosen initial values and the reference values of the transition probabilities tend to be larger than 
the RMSD between the randomly chosen initial values and the reference values of the emission 
probabilities. As Figure 2 shows, all three variants of posterior sampling training outperform Baum- Welch 
training in terms of parameter convergence, not only in terms of speed of convergence, but also in terms of 
asymptotic parameter convergence: after the maximum number of iterations, the RMSD values for the 
Baum- Welch algorithm are 0.004 (transitions) and 0.004 (emissions), whereas those for the posterior 
sampling training arc 0.002 (transitions) and 0.001 (emissions). Posterior sampling training is thus better 
at recovering the reference model's parameters than Baum- Welch training. The parameter convergence is 
essentially the same for all three variants of posterior sampling training, i.e. independent of whether one, 
three or five state paths are being sampled for every training sequence in every iteration. The CPU time 
required for every iteration using Baum-Welch training was 18.49 s compared to 8.06 s (1 state path), 
10.87 s (3 state paths) and 15.73 s (5 state paths) for posterior sampling training, i.e. posterior sampling 
algorithm was always faster than Baum-Welch training. 

To summarize, posterior sampling training produces significantly better results than Baum-Welch training 
both in terms of prediction performance and parameter convergence. For our HMM, posterior training 
works as well whether we sample one or five state path per training sequence and iteration. Viterbi 
training is the least successful of the three investigated training methods. It is not robust with respect to 
low counts for some transitions and emissions and terminates after a few iterations with a low performance. 
Posterior sampling training by sampling only a single state paths from the posterior distribution thus turns 
out to be a considerably more robust and reliable parameter training strategy than Viterbi training which 
also takes only one state path into account. Generally, and for more complex hidden Markov models and 
hidden Markov models with more variation in the values of their transition and emission probabilities, we 
expect a larger number of sampled state paths to improve parameter training. 
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Conclusion and discussion 

A wide range of Bioinformatics applications is based on hidden Markov models. The prediction 
performance of these applications depends to a large extent on the values of the model's transition and 
emission probabilities. For most applications, these values have been chosen, often manually, for a 
particular data set. Unless we have reliable ways of automatically training the parameter values for new 
data sets, we may not only limit the predictive power for one particular data set, but also have no way of 
adapting the same application to new data sets. Many existing applications would thus gain much wider 
applicability if they could be readily retrained for new data sets. Only few of the many existing 
HMM-based applications implement parameter training methods and those who do typically implement 
Viterbi training. 

We here introduce a new method for parameter training, called posterior sampling training, which samples 
state paths from the posterior distribution to train parameter values in an iterative procedure. We present 
a computationally efficient algorithm for posterior sampling training which linearizes the memory 
requirement and which is comparatively easy to implement. We not only study the theoretical properties of 
the new algorithm, but also investigate its practical properties by implementing it for the well-known 
example of the dishonest casino and by comparing it to Baum- Welch and Viterbi training. 
Our studies show that Baum- Welch training and posterior sampling training provide a significantly more 
robust and reliable way of training parameter values than Viterbi training. The new training method of 
posterior sampling is superior to Baum- Welch training as it requires fewer iterations to reach the same 
maximum performance and as the parameters values converge faster to their reference values. This holds 
even when only one state path is sampled for every training sequence in every iteration, thereby making 
posterior sampling training also more time and memory efficient than Baum- Welch training. 
We hope that the algorithms and the new parameter training method presented in this study will encourage 
the implementation of automatic parameter training methods for more Bioinformatics applications. 
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Figures 

Figure 1 - Performance 

The average performance as function of the number of iterations for the different training methods. The 
performance is defined as the product of the sensitivity and specificity and the average is the average of 
three training rounds, each starting with a different set of randomly chosen initial parameter values. For 
posterior sampling training, a fixed number of state paths were sampled for each training sequence in each 
iteration (one state path = Posteriorsampling 1, 3 state paths = Posteriorsampling 3, 5 state paths = 
Posteriorsampling 5) . The results of Viterbi training are not included in the plot as the algorithm 
terminated after a few iterations, see the text for more information. The error bars for each method 
correspond to the standard deviation of the performance from the three training rounds. 




Figure 2 - Parameter convergence 

Parameter convergence as function of the number of iterations for the different training methods. For every 
training method, there are two lines, one line showing the average root mean square difference (RMSD) for 
the transition probabilities and one line showing the average root mean square difference for the emission 
probabilities. The average is taken over three training rounds, each starting with a different set of 



20 



randomly chosen initial parameter values. The set of lines with higher RMSD values corresponds to the 
transition probabilities and the set of lines with lower RMSD values to the emission probabilities. For 
posterior sampling training, a fixed number of state paths were sampled for each training sequence in each 
iteration (one state path = Posteriorsampling 1, 3 state paths = Posteriorsampling 3, 5 state paths = 
Posteriorsampling 5) . The results of Viterbi training are not included in the plot as the algorithm 
terminated after a few iterations, see the text for more information. 




Tables 

Table 1 - Computational requirements 

Overview of the time and memory requirements for Viterbi training, Baum- Welch training and posterior 
sampling using different algorithms for an HMM with M states and one iteration. T max is the connectivity 
of the HMM, L the length of the single training sequence and K the number of state paths sampled in each 
iteration for every training sequence. 
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type of training 


type of algorithm 


time requirement 


memory requirement 


reference 


Viterbi training 


Viterbi algorithm 
Hirschbcrg algorithm 
Lam-Meyer algorithm 


0{MT max L) 
0(MT max L) 
0(MT max L) 


0{ML) 

0{M) 

0(M) 


[9] 
[18] 

this paper 


Baum- Welch 


Baum- Welch algorithm 
Miklos-Mcycr algorithm 


0(MT max L + L) 
0(MT max L) 


O(ML) 
0(M) 


[8] 
[16] 


posterior sampling 
training 


forward & backtracing 
Lam-Meyer algorithm 


0{MT max L + LT max K) 
0{MT max LK) 


O(ML) 

0(M + MT max + MK) 


this paper 
this paper 
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